In this section we document the parallel FFTW routines for
shared-memory threads on SMP hardware. These routines, which support
parallel one- and multi-dimensional transforms of both real and complex
data, are the easiest way to take advantage of multiple processors with
FFTW. They work just like the corresponding uniprocessor transform
routines, except that they take the number of parallel threads to use
as an extra parameter. Any program that uses the uniprocessor FFTW can
be trivially modified to use the multi-threaded FFTW.
* Menu:
* Installation and Supported Hardware/Software::
* Usage of Multi-threaded FFTW::
* How Many Threads to Use?::
* Using Multi-threaded FFTW in a Multi-threaded Program::
* Tips for Optimal Threading::
File: fftw.info, Node: Installation and Supported Hardware/Software, Next: Usage of Multi-threaded FFTW, Prev: Multi-threaded FFTW, Up: Multi-threaded FFTW
Installation and Supported Hardware/Software
--------------------------------------------
All of the FFTW threads code is located in the `threads'
subdirectory of the FFTW package. On Unix systems, the FFTW threads
libraries and header files can be automatically configured, compiled,
and installed along with the uniprocessor FFTW libraries simply by
including `--enable-threads' in the flags to the `configure' script
(*note Installation on Unix::.). (Note also that the threads routines,
when enabled, are automatically tested by the ``make check''
self-tests.)
The threads routines require your operating system to have some sort
of shared-memory threads support. Specifically, the FFTW threads
package works with POSIX threads (available on most Unix variants,
including Linux), Solaris threads, BeOS (http://www.be.com) threads
(tested on BeOS DR8.2), Mach C threads (reported to work by users), and
Win32 threads (reported to work by users). (There is also untested
code to use MacOS MP threads.) If you have a shared-memory machine
that uses a different threads API, it should be a simple matter of
programming to include support for it; see the file
`fftw_threads-int.h' for more detail.
SMP hardware is not required, although of course you need multiple
processors to get any benefit from the multithreaded transforms.
File: fftw.info, Node: Usage of Multi-threaded FFTW, Next: How Many Threads to Use?, Prev: Installation and Supported Hardware/Software, Up: Multi-threaded FFTW
Usage of Multi-threaded FFTW
----------------------------
Here, it is assumed that the reader is already familiar with the
usage of the uniprocessor FFTW routines, described elsewhere in this
manual. We only describe what one has to change in order to use the
multi-threaded routines.
First, instead of including `<fftw.h>' or `<rfftw.h>', you should
include the files `<fftw_threads.h>' or `<rfftw_threads.h>',
respectively.
Second, before calling any FFTW routines, you should call the
function:
int fftw_threads_init(void);
This function, which should only be called once (probably in your
`main()' function), performs any one-time initialization required to
use threads on your system. It returns zero if successful, and a
non-zero value if there was an error (in which case, something is
seriously wrong and you should probably exit the program).
Third, when you want to actually compute the transform, you should
use one of the following transform routines instead of the ordinary FFTW
functions:
fftw_threads(nthreads, plan, howmany, in, istride,
idist, out, ostride, odist);
fftw_threads_one(nthreads, plan, in, out);
fftwnd_threads(nthreads, plan, howmany, in, istride,
idist, out, ostride, odist);
fftwnd_threads_one(nthreads, plan, in, out);
rfftw_threads(nthreads, plan, howmany, in, istride,
idist, out, ostride, odist);
rfftw_threads_one(nthreads, plan, in, out);
rfftwnd_threads_real_to_complex(nthreads, plan, howmany, in,
istride, idist, out, ostride, odist);
rfftwnd_threads_one_real_to_complex(nthreads, plan, in, out);
rfftwnd_threads_complex_to_real(nthreads, plan, howmany, in,
istride, idist, out, ostride, odist);
rfftwnd_threads_one_real_to_complex(nthreads, plan, in, out);
rfftwnd_threads_one_complex_to_real(nthreads, plan, in, out);
All of these routines take exactly the same arguments and have
exactly the same effects as their uniprocessor counterparts (i.e.
without the ``_threads'') *except* that they take one extra parameter,
`nthreads' (of type `int'), before the normal parameters.(1) The
`nthreads' parameter specifies the number of threads of execution to
use when performing the transform (actually, the maximum number of
threads).
For example, to parallelize a single one-dimensional transform of
complex data, instead of calling the uniprocessor `fftw_one(plan, in,
out)', you would call `fftw_threads_one(nthreads, plan, in, out)'.
Passing an `nthreads' of `1' means to use only one thread (the main
thread), and is equivalent to calling the uniprocessor routine.
Passing an `nthreads' of `2' means that the transform is potentially
parallelized over two threads (and two processors, if you have them),
and so on.
These are the only changes you need to make to your source code.
Calls to all other FFTW routines (plan creation, destruction, wisdom,
etcetera) are not parallelized and remain the same. (The same plans and
wisdom are used by both uniprocessor and multi-threaded transforms.)
Your arrays are allocated and formatted in the same way, and so on.
Programs using the parallel complex transforms should be linked with
`-lfftw_threads -lfftw -lm' on Unix. Programs using the parallel real
transforms should be linked with `-lrfftw_threads -lfftw_threads
-lrfftw -lfftw -lm'. You will also need to link with whatever library
is responsible for threads on your system (e.g. `-lpthread' on Linux).
---------- Footnotes ----------
(1) There is one exception: when performing one-dimensional in-place
transforms, the `out' parameter is always ignored by the multi-threaded
routines, instead of being used as a workspace if it is non-`NULL' as
in the uniprocessor routines. The multi-threaded routines always
allocate their own workspace (the size of which depends upon the number
of threads).
File: fftw.info, Node: How Many Threads to Use?, Next: Using Multi-threaded FFTW in a Multi-threaded Program, Prev: Usage of Multi-threaded FFTW, Up: Multi-threaded FFTW
How Many Threads to Use?
------------------------
There is a fair amount of overhead involved in spawning and
synchronizing threads, so the optimal number of threads to use depends
upon the size of the transform as well as on the number of processors
you have.
As a general rule, you don't want to use more threads than you have
processors. (Using more threads will work, but there will be extra
overhead with no benefit.) In fact, if the problem size is too small,
you may want to use fewer threads than you have processors.
You will have to experiment with your system to see what level of
parallelization is best for your problem size. Useful tools to help you
do this are the test programs that are automatically compiled along with
the threads libraries, `fftw_threads_test' and `rfftw_threads_test' (in
the `threads' subdirectory). These take the same arguments as the
other FFTW test programs (see `tests/README'), except that they also
take the number of threads to use as a first argument, and report the
parallel speedup in speed tests. For example,
fftw_threads_test 2 -s 128x128
will benchmark complex 128x128 transforms using two threads and
report the speedup relative to the uniprocessor transform.
For instance, on a 4-processor 200MHz Pentium Pro system running
Linux 2.2.0, we found that the "crossover" point at which 2 threads
became beneficial for complex transforms was about 4k points, while 4
threads became beneficial at 8k points.
File: fftw.info, Node: Using Multi-threaded FFTW in a Multi-threaded Program, Next: Tips for Optimal Threading, Prev: How Many Threads to Use?, Up: Multi-threaded FFTW
Using Multi-threaded FFTW in a Multi-threaded Program
Except for the first argument, the parameters are identical to those
of `fftw2d_create_plan'. (There are also analogous
`fftwnd_mpi_create_plan' and `fftw3d_mpi_create_plan' functions.
Transforms of any rank greater than one are supported.) The first
argument is an MPI "communicator", which specifies the group of
processes that are to be involved in the transform; the standard
constant `MPI_COMM_WORLD' indicates all available processes.
Next, one has to allocate and initialize the data. This is somewhat
tricky, because the transform data is distributed across the processes
involved in the transform. It is discussed in detail by the next
section (*note MPI Data Layout::.).
The actual computation of the transform is performed by the function
`fftwnd_mpi', which differs somewhat from its uniprocessor equivalent
and is described by:
void fftwnd_mpi(fftwnd_mpi_plan p,
int n_fields,
fftw_complex *local_data, fftw_complex *work,
fftwnd_mpi_output_order output_order);
There are several things to notice here:
* First of all, all `fftw_mpi' transforms are in-place: the output is
in the `local_data' parameter, and there is no need to specify
`FFTW_IN_PLACE' in the plan flags.
* The MPI transforms also only support a limited subset of the
`howmany'/`stride'/`dist' functionality of the uniprocessor
routines: the `n_fields' parameter is equivalent to
`howmany=n_fields', `stride=n_fields', and `dist=1'.
(Conceptually, the `n_fields' parameter allows you to transform an
array of contiguous vectors, each with length `n_fields'.)
`n_fields' is `1' if you are only transforming a single, ordinary
array.
* The `work' parameter is an optional workspace. If it is not
`NULL', it should be exactly the same size as the `local_data'
array. If it is provided, FFTW is able to use the built-in
`MPI_Alltoall' primitive for (often) greater efficiency at the
expense of extra storage space.
* Finally, the last parameter specifies whether the output data has
the same ordering as the input data (`FFTW_NORMAL_ORDER'), or if
it is transposed (`FFTW_TRANSPOSED_ORDER'). Leaving the data
transposed results in significant performance improvements due to
a saved communication step (needed to un-transpose the data).
Specifically, the first two dimensions of the array are
transposed, as is described in more detail by the next section.
The output of `fftwnd_mpi' is identical to that of the corresponding
uniprocessor transform. In particular, you should recall our
conventions for normalization and the sign of the transform exponent.
The same plan can be used to compute many transforms of the same
size. After you are done with it, you should deallocate it by calling
`fftwnd_mpi_destroy_plan'.
Important: The FFTW MPI routines must be called in the same order by
all processes involved in the transform. You should assume that they
all are blocking, as if each contained a call to `MPI_Barrier'.
Programs using the FFTW MPI routines should be linked with
`-lfftw_mpi -lfftw -lm' on Unix, in addition to whatever libraries are
required for MPI.
File: fftw.info, Node: MPI Data Layout, Next: Usage of MPI FFTW for Real Multi-dimensional Transforms, Prev: Usage of MPI FFTW for Complex Multi-dimensional Transforms, Up: MPI FFTW
MPI Data Layout
---------------
The transform data used by the MPI FFTW routines is "distributed": a
distinct portion of it resides with each process involved in the
transform. This allows the transform to be parallelized, for example,
over a cluster of workstations, each with its own separate memory, so
that you can take advantage of the total memory of all the processors
you are parallelizing over.
In particular, the array is divided according to the rows (first
dimension) of the data: each process gets a subset of the rows of the
data. (This is sometimes called a "slab decomposition.") One
consequence of this is that you can't take advantage of more processors
than you have rows (e.g. `64x64x64' matrix can at most use 64
processors). This isn't usually much of a limitation, however, as each
processor needs a fair amount of data in order for the
parallel-computation benefits to outweight the communications costs.
Below, the first dimension of the data will be referred to as ``x''
and the second dimension as ``y''.
FFTW supplies a routine to tell you exactly how much data resides on
the current process:
void fftwnd_mpi_local_sizes(fftwnd_mpi_plan p,
int *local_nx,
int *local_x_start,
int *local_ny_after_transpose,
int *local_y_start_after_transpose,
int *total_local_size);
Given a plan `p', the other parameters of this routine are set to
values describing the required data layout, described below.
`total_local_size' is the number of `fftw_complex' elements that you
must allocate for your local data (and workspace, if you choose).
(This value should, of course, be multiplied by `n_fields' if that
parameter to `fftwnd_mpi' is not `1'.)
The data on the current process has `local_nx' rows, starting at row
`local_x_start'. If `fftwnd_mpi' is called with
`FFTW_TRANSPOSED_ORDER' output, then `y' will be the first dimension of
the output, and the local `y' extent will be given by
`local_ny_after_transpose' and `local_y_start_after_transpose'.
Otherwise, the output has the same dimensions and layout as the input.
For instance, suppose you want to transform three-dimensional data of
size `nx x ny x nz'. Then, the current process will store a subset of
this data, of size `local_nx x ny x nz', where the `x' indices
correspond to the range `local_x_start' to `local_x_start+local_nx-1'
in the "real" (i.e. logical) array. If `fftwnd_mpi' is called with
`FFTW_TRANSPOSED_ORDER' output, then the result will be a `ny x nx x
nz' array, of which a `local_ny_after_transpose x nx x nz' subset is
stored on the current process (corresponding to `y' values starting at
`local_y_start_after_transpose').
The following is an example of allocating such a three-dimensional
array array (`local_data') before the transform and initializing it to
* Although the local data is of dimensions `local_nx x ny x nz' in
the above example, do *not* allocate the array to be of size
`local_nx*ny*nz'. Use `total_local_size' instead.
* The amount of data on each process will not necessarily be the
same; in fact, `local_nx' may even be zero for some processes.
(For example, suppose you are doing a `6x6' transform on four
processors. There is no way to effectively use the fourth
processor in a slab decomposition, so we leave it empty. Proof
left as an exercise for the reader.)
* All arrays are, of course, in row-major order (*note
Multi-dimensional Array Format::.).
* If you want to compute the inverse transform of the output of
`fftwnd_mpi', the dimensions of the inverse transform are given by
the dimensions of the output of the forward transform. For
example, if you are using `FFTW_TRANSPOSED_ORDER' output in the
above example, then the inverse plan should be created with
dimensions `ny x nx x nz'.
* The data layout only depends upon the dimensions of the array, not
on the plan, so you are guaranteed that different plans for the
same size (or inverse plans) will use the same (consistent) data
layouts.
File: fftw.info, Node: Usage of MPI FFTW for Real Multi-dimensional Transforms, Next: Usage of MPI FFTW for Complex One-dimensional Transforms, Prev: MPI Data Layout, Up: MPI FFTW
Usage of MPI FFTW for Real Multi-dimensional Transforms